Skip to content

Conversation

@Mozoloa
Copy link

@Mozoloa Mozoloa commented Jan 17, 2026

This PR adds optional GPU acceleration to the covariance matrix decomposition and rotation-to-quaternion conversion, providing significant performance improvements for CUDA users while maintaining backward compatibility.

Changes

linalg.py

  • Add use_gpu parameter to quaternions_from_rotation_matrices() (default: True)
  • Add pure PyTorch GPU implementation using Shepperd's method
  • Original scipy CPU implementation preserved as fallback
  • ~300x faster for large batches (2M+ gaussians)

gaussians.py

  • Add use_gpu parameter to decompose_covariance_matrices() (default: True)
  • GPU path: SVD on GPU + vectorized reflection correction
  • CPU path: original float64 behavior preserved for maximum precision
  • Automatic fallback to CPU if GPU SVD fails

Performance

Tested on RTX 4090 with ~700k gaussians per frame:

  • Before: ~4.0s per frame (3s quaternion conversion on CPU)
  • After: ~1.0s per frame
  • 4x overall speedup

The bottleneck was scipy.spatial.transform.Rotation.from_matrix() which requires CPU transfer and numpy conversion. The new GPU implementation stays entirely on device.

Backward Compatibility

  • Default behavior unchanged for CPU tensors
  • Set use_gpu=False to force original CPU behavior
  • API is fully backward compatible (new parameter has default value)

This PR adds optional GPU acceleration to the covariance matrix decomposition
and rotation-to-quaternion conversion, providing significant performance
improvements for CUDA users while maintaining backward compatibility.

## Changes

### `linalg.py`
- Add `use_gpu` parameter to `quaternions_from_rotation_matrices()` (default: True)
- Add pure PyTorch GPU implementation using Shepperd's method
- Original scipy CPU implementation preserved as fallback
- ~300x faster for large batches (2M+ gaussians)

### `gaussians.py`
- Add `use_gpu` parameter to `decompose_covariance_matrices()` (default: True)
- GPU path: SVD on GPU + vectorized reflection correction
- CPU path: original float64 behavior preserved for maximum precision
- Automatic fallback to CPU if GPU SVD fails

## Performance

Tested on RTX 4090 with ~700k gaussians per frame:
- Before: ~4.0s per frame (3s quaternion conversion on CPU)
- After: ~1.0s per frame
- **4x overall speedup**

The bottleneck was `scipy.spatial.transform.Rotation.from_matrix()` which
requires CPU transfer and numpy conversion. The new GPU implementation
stays entirely on device.

## Backward Compatibility

- Default behavior unchanged for CPU tensors
- Set `use_gpu=False` to force original CPU behavior
- API is fully backward compatible (new parameter has default value)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant